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Abstract 

We propose a method for predicting splice graphs that enhances curated gene models using evidence from RNA- 
Seq and EST alignments. Results obtained using RNA-Seq experiments in Arobidopsis tholiono show that predictions 
made by our SpliceGrapher method are more consistent with current gene models than predictions made by TAU 
and Cufflinks. Furthermore, analysis of plant and human data indicates that the machine learning approach used 
by SpliceGrapher is useful for discriminating between real and spurious splice sites, and can improve the reliability 
of detection of alternative splicing. SpliceGrapher is available for download at http://SpliceGrapher.sf.net. 



Background 

Deep transcriptome sequencing (RNA-Seq) with next- 
generation sequencing (NGS) technologies is providing 
unprecedented opportunities for researchers to probe 
the transcriptomes of many species [1-5]. An important 
goal in these studies is to assess the extent of alternative 
splicing (AS), a process that increases transcriptome and 
proteome diversity, and plays a key role in regulating 
gene expression and protein function [6,7]. Although it 
is inexpensive and easy to obtain whole transcriptome 
data using RNA-Seq, one limitation has been the lack of 
versatile methods to analyze these data. Consequently, 
there is an increasing demand for methods that can use 
the short reads produced in these studies to predict pat- 
terns of AS. 

The sequences produced by NGS methods have char- 
acteristics that complicate the task of identifying the 
mRNA transcripts represented in a sample. A sequen- 
cing read may consist of fewer than 40 nucleotides, 
making it difficult to identify a unique origin within a 
reference sequence. In addition, NGS base-call error 
rates tend to increase with read length, raising the 
chance of a mismatch when aligning a read to a refer- 
ence sequence [8]. These ambiguities are exacerbated by 
the presence of paralogous genes that can give rise to 

Correspondence: asa@cs.colostate.edu 
^Department of Computer Science, 1873 Campus Delivery, Colorado State 
University, Fort Collins, CO 80523-1873, USA 
Full list of author information is available at the end of the article 



reads that align well in multiple locations. Much of the 
work on analyzing NGS reads has focused on aligning 
reads within exonic regions, and many methods exist for 
the problem of aligning reads without gaps-for example, 
MAQ [9], PASS [10] and BowTie [11]. 

Reads that span splice junctions introduce additional 
challenges. A splice junction may occur anywhere within 
a short read, so the read may have just a few bases on 
one side of a junction. Such a short sequence may align 
in multiple locations, making it difficult to identify its 
true origin. One can use heuristics to restrict the num- 
ber of candidate locations: for example, by establishing 
limits for permissible intron lengths, or by focusing on 
locations that are bounded by canonical GT-AG or GC- 
AG splice-site dimers. Several spliced alignment algo- 
rithms exist that use these and other approaches to 
identify unique alignments for spliced reads [12-17]. 

The first studies that used RNA-Seq data to predict 
AS focused on exon-skipping events, the most prevalent 
form of AS in mammals (see, for example, [1,18-21]). 
To identify splice junctions recapitulated in short read 
data, these studies used exon sequences flanking anno- 
tated splice sites to produce a database of splice junction 
sequences. Using novel combinations of known acceptor 
and donor sites, researchers can create a database that 
consists of both known and putative splice junction 
sequences. RNA-Seq reads that align to these putative 
sequences then provide evidence for novel splicing 
events. 
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In this work we compare our approach with two 
methods: Cufflinks [22], and TAU [3]. Both methods 
were originally designed for de novo splice form predic- 
tion, but can make limited use of existing annotations. 
TAU predicts splice forms by assembling for each gene 
all feasible combinations of exons that have been identi- 
fied in the alignment and spliced-alignment step. This 
approach ensures that no splice form will be overlooked, 
but it can produce a large number of transcripts that 
make it difficult to identify the most realistic predic- 
tions. Cufflinks uses a more sophisticated approach that 
seeks to identify the smallest set of mRNA transcripts 
that explain the observed data [22]. The Scripture 
method constructs a transcript graph on the basis of 
regions that exhibit statistically significant read depth 
compared to genomic background and predicts splice 
forms by considering all paths through the graph, an 
approach similar to TAU's [23]. Methods that perform 
de novo transcriptome assembly without requiring a 
reference genome-for example. Trinity [24] and ABySS 
[25] -are not included in our comparison. 

To address ambiguities that inevitably arise when 
using short reads, and to take advantage of a rich and 
fast-evolving body of transcriptome data, we developed 
SpliceGrapher, a Python-based scripting tool designed to 
leverage gene annotations and ESTs, in addition to NGS 
reads. SpliceGrapher predicts 'splice graphs' (Figure 1), 
which capture in a single structure all the ways in which 
exons for a given gene may be assembled [26-32]. In 
these graphs, exons are depicted as nodes and introns 
are the edges that connect them. The compact structure 
allows researchers to visualize AS easily; and further- 
more, we argue that NGS data rarely support the pre- 
diction of novel splice forms unambiguously. The splice 
graph structure permits SpliceGrapher to evaluate NGS 
data in the context of existing annotations, facilitates 
automated statistical analysis of AS events [31], and aids 
in comparing AS behaviour between gene families [32]. 

A few tools, such as Sircah, can produce splice graphs 
from conventional EST or cDNA alignments [29]. In 
previous work we enhanced Sircah's AS detection rules 
and extended the package to provide statistics and pro- 
tein predictions for genome-wide studies of AS based 
on EST data [31]. With SpliceGrapher we extend this 
idea by incorporating multiple forms of data, including 
RNA-Seq reads, into splice graph predictions. 

SpliceGrapher was designed from the outset to inte- 
grate RNA-Seq data, annotated gene models and EST 
alignments to produce comprehensive splice graph pre- 
dictions. It applies inference rules to generate predic- 
tions that are compatible with all available evidence. 
The package includes flexible visualization tools that can 
depict splice graphs along with the evidence used to 
predict them. We compare SpliceGrapher's predictions 



with splice graphs generated from TAU and Cufflinks 
output based on RNA-Seq data and find that our results 
are more consistent with existing evidence from curated 
gene models. 

Results and discussion 

Splice graph prediction pipeline 

SpliceGrapher is designed to enhance existing gene 
models using RNA-Seq and EST data. Relying on exist- 
ing annotations as a baseline provides SpliceGrapher 
with a context in which to interpret short-read data. 
SpliceGrapher s splice graph prediction pipeline consists 
of the following steps (Figure 2): ungapped alignment of 
short reads to the reference genome, spliced alignment 
of reads that did not align in the first step, initial splice 
graph construction from the annotated gene models, 
assembly of exons from the ungapped short-read align- 
ments, and insertion of the new exons into the splice 
graph using spliced alignments. SpliceGrapher accepts 
as input EST alignments as well; these are interpreted as 
splice graphs that SpliceGrapher merges with its gene 
model baseline graphs. 

Figure 1 provides an example of a splice graph that 
SpliceGrapher predicted for a gene in Arabidopsis thali- 
ana, along with the graph constructed from the gene 
model, the splice junctions that were recapitulated in 
the RNA-Seq data, and the read depth along the geno- 
mic region. SpliceGrapher combined these different 
forms of evidence and predicted only those novel spU- 
cing events it could resolve unambiguously. 

We first present results on RNA-Seq data from two 
plant genomes; applicability to mammalian genomes is 
demonstrated in a later section. We ran SpliceGrapher 
on short-read data from A. thaliana [3] and Vitis vini- 
fera [33]. In Arabidopsis we provided SpliceGrapher 
with The Arabidopsis Information Resource (TAIR) ver- 
sion 9 of the genome annotations. New splice forms 
annotated in TAIR version 10 were reserved for asses- 
sing the ability of SpliceGrapher to detect novel annota- 
tions. SpliceGrapher predicted nearly 1,500 AS events 
that were not present in the TAIR9 version of the A. 
thaliana genome annotations. In V, vinifera, whose gene 
models showed no AS, SpliceGrapher predicted more 
than 2,600 events. The breakdown of those events by 
type (intron retention, exon skipping, and alternative 5' 
or 3') is found in Table 1. In plants, intron retention 
(IR) is the most prevalent form of AS [34,35] and 
accounts for more than 30% of AS events in the A. 
thaliana gene models (see also [35]). The novel exon- 
skipping (ES) and alternative 5' events predicted in A. 
thaliana are in proportions similar to those found in 
the gene models (Table 1), while alternative 3' events 
are higher. Notably, just 21% of the novel events are IR, 
compared to 33% in the gene models. IR events are 
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Figure 1 Example of a predicted splice graph in A thaliana. RNA-Seq alignment data were loaded along with gene model annotations to 
create a composite model that incorporates all available evidence. SpliceGrapher's visualization modules produce color-coded graphs based on 
the color scheme used by Sircah [29] that makes it easy to see exons and introns involved in AS events. RNA-Seq read coverage across one of 
the introns was sufficient to allow SpliceGrapher to identify an intron retention event (exon outlined in blue). In addition, a novel splice junction 
(highlighted in green) provided SpliceGrapher with evidence for an alternative 3' splicing event (highlighted in orange). The numbers associated 
with splice junctions indicate the number of reads that align across it. Vertical bands in the background depict exon boundaries in the original 
gene model. 



more difficult to predict from RNA-Seq data because composed of longer 76-nucleotide reads in A. thaliana; 

prediction depends on having read coverage across an despite having less reads (41 million compared to 284 

intron's full length (Figure 3 and Materials and meth- million), SpliceGrapher was able to find close to 2,400 

ods). We have also run SpliceGrapher on a dataset novel AS events in the new dataset, compared to close 
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Figure 2 Splice graph prediction pipeline. SpliceGrapher predicts splice graphs using information from gene models, EST alignments and 
RNA-Seq data. RNA-Seq exonic alignments may be performed using any popular short-read alignment tool. RNA-Seq spliced alignments may be 
performed using a conventional short-read mapping tool with a database of splice junctions predicted by SpliceGrapher, or they may be 
performed using short-read spliced-alignment programs such as TopHat, followed by filtering using SpliceGrapher's database of predicted splice 
sites. SpliceGrapher incorporates all of this information to produce a comprehensive splice graph prediction. 



to 1,500 in the 32-nucleotide read data. Complete statis- 
tics for the new data are provided in Table SI in Addi- 
tional file 1. In V. vinifera the rate of IR prediction is 
even lower, at 13%. This can be explained by the longer 
introns in V. vinifera, where the average intron length 
we found in the annotated genes is 969 nucleotides 
(compared with 170 nucleotides for A, thaliana). The 
intron length influences the predicted IR rate for V, 
vinifera by making prediction of IR from RNA-Seq data 
more difficult. Furthermore, V. vinifera may have a 
lower IR rate to begin with, since low IR rates are 
observed in species with long introns [36]. 

Comparison with TAU and Cufflinks 

To see how SpliceGrapher's AS predictions compared 
with other tools, we ran the HashMatch/Supersplat/ 
TAU and TopHat/Cufflinks pipelines, both with and 



without gene models, using the same RNA-Seq data as 
SpUceGrapher, and assembled spUce graphs from their 
transcript predictions. We note that Cufflinks was 
designed for mammalian genomes: its assembly algo- 
rithm includes heuristics that are based on the charac- 
teristics of human and mouse transcripts [22] . TAU was 
designed to analyze the A, thaliana data we consider 
here [3]. 

Our first step in comparing SpliceGrapher to Cufflinks 
and TAU was to test the ability of each package to pre- 
dict given and novel annotations. This was quantified by 
computing the fraction of exons and introns in the 
annotations that were predicted by each package (Equa- 
tion 1 in the Materials and methods section); this is the 
'recall' at the exon/intron level. These statistics are sum- 
marized in Table 2. First we discuss the rate at which 
each package is able to recall TAIR9 annotations. 
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Table 1 Alternative splicing events 





AS 






AS events 








genes 


IR 


ES 


Alt. 5' 


Alt. 3' 


Total 


A. tholiana models 


4,029 


1,987 (33%) 


550 (9%) 


1,256 (21%) 


2,145 (36%) 


5,938 


SpliceGrapher 














No ESTs 


4,901 


2,248 (30%) 


714 (10%) 


1,560 (21%) 


2,866 (39%) 


7,388 


Novel 


885 


308 (21%) 


164 (11%) 


304 (20%) 


721 (48%) 


1,497 


With ESTs 


6,162 


3,658 (33%) 


994 (9%) 


2,335 (21%) 


4,128 (37%) 


9,916 


Novel 


2,154 


1,779 (34%) 


444 (8%) 


1,079 (20%) 


1,983 (38%) 


5,285 


Cufflinks 














No gene models 


1,263 


449 (32%) 


383 (28%) 


237 (17%) 


319 (23%) 


1,388 


Novel 


699 


429 (32%) 


380 (28%) 


232 (17%) 


304 (23%) 


1,345 


With gene models 


6,056 


4,029 (39%) 


2,857 (27%) 


1^427 (14%) 


2,106 (20%) 


10,419 


Novel 


2,319 


2,232 (38%) 


2,550 (43%) 


552 (9%) 


594 (10%) 


5,928 


TAU 














No gene models 


2,777 


893 (17%) 


475 (9%) 


1,481 (27%) 


2,555 (47%) 


5,404 


Novel 


1,591 


81 1 (16%) 


460 (9%) 


1,431 (28%) 


2,351 (47%) 


5,053 


With gene models 


10,458 


94,571 (85%) 


598 (1%) 


5,972 (5%) 


9,820 (9%) 


1 10,961 


Novel 


8,364 


94,124 (86%) 


476 (0%) 


5,697 (5%) 


9,219 (8%) 


109,516 


V. vinifera models 


0 


0 (0%) 


0 (0%) 


0 (0%) 


0 (0%) 


0 


SpliceGrapher 


2,039 


347 (13%) 


830 (31%) 


640 (24%) 


838 (32%) 


2,655 


TAU 














No gene models 


3,099 


531 (10%) 


684 (13%) 


1,321 (25%) 


2,743 (52%) 


5,279 


With gene models 


15,874 


135,585 (72%) 


4,938 (3%) 


23,615 (13%) 


24,406 (13%) 


188,544 


Cufflinks 














No gene models 


1,057 


324 (24%) 


519 (39%) 


140 (11%) 


349 (26%) 


1,332 


With gene models 


4,263 


4,120 (34%) 


3,148 (26%) 


2,165 (18%) 


2,818 (23%) 


12,251 



The number of AS events detected by SpliceGrapher, Cufflinks, and TAU compared with events inferred from the TAIR9 annotations. We track the following AS 
event types: intron retention (IR), exon skipping (ES), alternative 5' sites (Alt. 5') and alternative 3' sites (Alt. 3'). SpliceGrapher uses the TAIR9 gene models as a 
baseline, so it includes all of the same AS events along with additional events inferred from RNA-Seq data. Without gene models, nearly all TAU and Cufflinks 
predictions are novel AS events. With gene models, more than half of Cufflinks predictions reproduce AS events from the gene models. TAU uses known splice 
sites to predict all possible exons in a gene, generating vast numbers of novel exons and novel IR events. 



Without gene models, Cufflinks and TAU achieve simi- 
lar levels of recall, which are below 0.29. This illustrates 
the difficulty of predicting transcriptional activity from 
NGS data alone. Recall improves dramatically when we 
provide Cufflinks and TAU with gene models: for Cuf- 
flinks it increases to 0.94 at the exon level, and to 0.89 
at the intron level; for TAU it increases to 0.69 at the 
exon level, and 0.79 at the intron level. Since SpliceGra- 
pher uses the input annotations as a baseline, it achieves 
perfect recall on TAIR9; Cufflinks and TAU sometimes 
ignore the annotations that are provided as input, which 
is a result of how they use annotations in their algo- 
rithms. Cufflinks, for example, breaks down annotated 
exons into pseudo-reads that are incorporated into its 
algorithm; and since it looks for a minimal set of tran- 
scripts that explain the observed read depth, the annota- 
tions can be ignored. 

When TAU and Cufflinks use only RNA-Seq data, the 
transcripts that they predict are sometimes fragmented 
as a result of gaps in short-read coverage (Figure 4; Fig- 
ures SI to S3 in Additional file 1). For Cufflinks the 



number of genes for which it made predictions 
increased significantly when provided with gene models: 
from 10,277 genes to 28,277 in A. thaliana and from 
10,022 to 23,634 in V. vinifera. For TAU the increase 
was not so dramatic, and when given gene models we 
have observed that it tends to predict what we believe 
are overly complex models. Figure S2 in Additional file 
1 is a representative example. 

None of the packages was able to predict more than 
8% of the new annotations that were added in the 
TAIRIO version of the Arabidopsis genome annotations 
(see Table 2 for details). SpliceGrapher achieved better 
recall than Cufflinks-with or without annotations-despite 
making fewer overall predictions of new exons and 
introns; this is a strong indication that SpliceGrapher's 
predictions have fewer false positives. Surprisingly, the 
recall level for Cufflinks with gene models was lower 
than without them. SpliceGrapher also achieved better 
recall than TAU without annotations. With gene mod- 
els, TAU achieved the highest recall at the exon and 
intron levels, but we believe it is a result of a severe 



Rogers et al. Genome Biology 2012, 13:R4 
http://genomebiology.com/201 2/1 3/1 /R4 



Page 6 of 1 7 



Gene Model for AT1G27920 



SpliceGrapher AT1G27920 Prediction 




Spliced Reads 



k-^H l-H I'^'l 



Read Coverage 



a 

a; 

5 10° 



9727000 



9727500 



9728000 



9728500 



9729000 



9729500 



Alt. 5' 



Figure 3 Ambiguities in RNA-Seq data. This figure demonstrates ambiguities tliat arise in RNA-Seq data tliat make isoform prediction 
cliallenging. Because tliere is read coverage across several introns, SpliceGraplier is not able to determine whether this is a result of a single 
intron retention event, or several independent events. 



over-annotation that was discussed earlier. At the tran- 
script level SpliceGrapher performed best. Its splice 
graphs were consistent with 28 new splice forms, com- 
pared with 4 for Cufflinks and 11 for TAU when pro- 
vided with TAIR9 annotations. Both packages were 
unable to correctly predict any transcript without using 
the gene models. In Additional file 1 we compare Cuf- 
flinks and SpliceGrapher predictions to results we have 
recently obtained using a curated set of full-length 
cDNAs and ESTs for SR genes in Arabidopsis [32]. 



Alternative splicing predictions 

SpliceGrapher predicts AS events in proportions that 
more closely match the gene models than TAU and 
Cufflinks (Table 1). TAU uses known splice sites to pre- 
dict all possible putative exons for a gene, resulting in a 
vast number of IR events that represent 85% of its AS 
predictions (Figure S2 in Additional file 1 demonstrates 
the issue). There is a noticeable discrepancy in the rate 
at which Cufflinks detects exon skipping events. Exon 
skipping accounts for 9% of the AS events in the gene 
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Table 2 Recall of TAIR9 and TAIR10 annotations 

TAIR9 TAIR10 



Recall Novel Transcripts Recall 



Method 


Exons 


Introns 


Exons 


Introns 


Number 


Percentage 


Exons 


Introns 


Splicegrapher 


1.00 


1.00 


1,428 


1,282 


28 


1.4% 


0.039 


0.045 


Splicegrapher + EST 


1.00 


1.00 


11,299 


3,557 


38 


1.9% 


0.050 


0.056 


Cufflinks 


















No gene models 


0.25 


0.19 


33,252 


3,425 


0 


0.0% 


0.035 


0.017 


With gene models 


0.94 


0.89 


12,690 


5,222 


4 


0.2% 


0.017 


0.008 


TAU 


















No gene models 


0.21 


0.29 


86,346 


5,335 


0 


0.0% 


0.043 


0.029 


With gene models 


0.69 


0.79 


115,130 


3,734 


11 


1.1% 


0.079 


0.074 



Comparison of the ability of SpliceGrapher, Cufflinks, and TAU to predict TAIR9 and TAIR10 annotations in A thaliona. The columns of the table provide recall 
levels of TAIR9 annotations, the number of novel exons and introns that are predicted, i.e., are not in the TAIR9 annotations, the number and percentage of 
TAIR10 transcripts that are predicted, and the recall level of TAIR10 annotations at the exon and intron level. When TAU and Cufflinks rely on RNA-Seq data alone 
they tend to produce graphs that are missing many of the features found in the gene models, as reflected in their recall scores between 0.19 and 0.29. When we 
provide them with TAIR9 annotations their recall scores improve, though TAU's improved statistics result from a vast number of novel exons, many of which may 
be spurious. When comparing these predictions with novel splice forms in the TAIR10 gene models, SpliceGrapher predicts more novel splice forms correctly 
than the other packages. 



models, whereas 43% of the novel AS events predicted 
by Cufflinks with gene models are ES. Detection of an 
exon skipping event requires a novel splice junction, 
and therefore depends on accurate splice junction detec- 
tion. Both TAU and Cufflinks ignore the sequence char- 
acteristics of putative junctions, a practice that can 
easily lead to false positives. SpliceGrapher uses splice 
site classifiers to ensure the accuracy of reads that span 
splice junctions. We believe that the lower rate of exon 
skipping detected by SpliceGrapher is a result of better 
control of the quality of splice junction alignments. 
Below we discuss this issue in detail. 

Splice junction reads 

We compared splice-junction read predictions made by 
the three packages. We restricted this comparison to 
junctions within known genes that were bounded by 
canonical GT-AG or GC-AG splice sites. Accurate 
splice junction alignments are crucial to the quality of 
AS predictions. This is illustrated in Figures S4 to S8 in 
Additional file 1. SpliceGrapher uses its splice-site classi- 
fiers to predict donor and acceptor sites that are then 
used to build a database of putative splice junction 
sequences. Short reads that align to these sequences 
provide evidence of splicing events recapitulated in the 
RNA-Seq data. Supersplat performs spliced alignment 
by attempting to map each end of a read to locations in 
the genome. It accepts alignments where a read's ends 
match genomic sequences with 100% identity and the 
inferred intron length is within specified limits. TopHat 
maps spliced reads by splitting the reads into segments, 
aligning the segments to genomic sequences, and 
accepting spliced alignments when they infer introns 
that are bounded by canonical GT-AG splice-site dimers 
and have lengths within specified limits. 



Although the three packages find similar numbers of 
reads that span splice junctions, there is a sizable discre- 
pancy in the number of novel splice junctions that were 
detected. For example, TopHat detected 14,572 novel 
splice junctions in A. thaliana. Only 1,982 of those were 
also detected by SpliceGrapher. Out of the 12,590 that 
were not detected by SpliceGrapher, 9,942 contained a 
putative splice site that was classified as a false-positive 
by our splice-site classifiers. Our classifiers achieve very 
high accuracy in A. thaliana, with an area under the 
receiver operating characteristic (ROC) curve between 
0.95 and 0.97 for GT, GC, and AG sites (see Materials 
and methods section for details, and ROC curves in Fig- 
ure S9 in Additional file 1). Therefore, the false positives 
identified in Table 3 for the alignments produced by 
TAU and Cufflinks are likely accurate. 

Our use of a filtering step raises the concern that we 
are removing legitimate splice junctions, despite the 
high accuracy of the splice junction classifiers. As 
further validation, we compare the rate at which each 
package recapitulates splice junctions that are observed 
in a large coUection of EST alignments. For A. thaliana 
the ESTs consisted of 1.5 million sequences from the 
National Center for Biotechnology Information (NCBI) 
dbEST database [37] and 71,806 sequences from [35], 
making a total of 1.6 million ESTs that aligned to 4,696 
(16.5%) A. thaliana genes. For V. vinifera we down- 
loaded 352,984 ESTs from the Plant Genome Database 
[38]. These ESTs aligned to 10,168 (43%) V. vinifera 
genes. The ESTs were aligned using a pipeline devel- 
oped in [32], which uses GMAP [39] to align ESTs to a 
reference genome, and then assigns aligned ESTs to 
gene regions and corrects alignment artifacts. The 
results in Table 4 show that the three packages had a 
similar proportion of RNA-Seq junctions that matched 
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Figure 4 Example of a Cufflinks prediction. We provide the predictions made by Cufflinks for tine same gene wliose SpliceGraplier 
predictions are shown in Figure 1. Some of the splice junctions used by Cufflinks are predicted to be false positives by SpliceGrapher's accurate 
splice junction classifiers (red edges in the plot). These lead to detection of questionable AS events. 



junctions from the ESTs in both species. SpUceGrapher 
had the highest proportion of matches in A. thaliana 
and the second-highest in V. vinifera, indicating that its 
predictions were not adversely affected by the splice 
junction filtering step. 

We next illustrate that spliced alignment filtering is 
important, even as read length increases. To do so, we 
generated 41 million 76-nucleotide reads for A, thaliana 
and created a set of shorter reads by truncating the 76- 



nucleotide reads to 32 nucleotides; we then ran TopHat 
on the two sets. We used the same TopHat parameters 
in both cases except for the segment length, which we 
set to 20 nucleotides for the 32-nucleotide reads and 26 
nucleotides for the 76-nucleotide reads. The 76-nucleo- 
tide reads produced more than four times as many 
spliced alignments as their truncated counterparts; the 
shorter reads resulted in a slightly larger number of 
ungapped alignments since their full-length versions 
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Table 3 Comparison of splice junctions identified by each package 



SpliceGrapher 



Supersplat 



TopHat 



A. thaliano 

Canonical junctions (GT-AG/GC-AG) witiiin genes 
Junctions in common 
Novel junctions 

Novel junctions with a false-positive site 
Novel junctions in common 
V. vinifera 

Canonical junctions (GT-AG/GC-AG) within genes 
Junctions in common 
Novel junctions 

Novel junctions with a false-positive site 
Novel junctions in common 



80,421 
4,969 

74,457 
9,831 



84,744 
74,821 
7,255 
3,077 
3,599 

82,281 
70,554 
13,394 
4,040 
8,020 



83,367 
63,710 
14,572 
9,942 
1,982 

65,439 
59,154 
6,307 
1,899 
3,662 



Side-by-side comparison of canonical splice junctions identified by each package, reconciling differences between SpliceGrapher and the other two packages for 
A thaliana (top half) and V. vinifera (bottom half). For each species we show the number of canonical splice junctions each package found recapitulated in the 
RNA-Seq reads. For Supersplat and TopHat we also show the number of junctions each shares with SpliceGrapher and the number of novel Junctions for which 
SpliceGrapher's classifiers identified either the donor site or the acceptor site as a false positive. 



might span splice junctions (see Table S2 in Additional 
file 1 for details). The additional spliced alignments for 
76-nucleotide reads increased the number of novel junc- 
tions more than six-fold, but the proportion of false- 
positive junctions also jumped from 24% to 39%. These 
results demonstrate the value of increased read length 
for sensitivity in splice junction detection, and yet longer 
read length alone does not guarantee accurate splice 
junction predictions. 

Inaccurate spliced-alignment can have a strong impact 
on the accuracy of AS detection. To demonstrate this, 
we looked at the splice junctions predicted by TAU and 
Cufflinks and collected statistics for the AS events asso- 
ciated with those junctions that SpliceGrapher's classi- 
fiers identified as false positives (Table S3 in Additional 
file 1). We identified spurious splice junctions as those 
that contained at least one false-positive splice site. 
With gene models, 24% of the AS events predicted by 
Cufflinks could be attributed to spurious splice junc- 
tions, compared to 59% without gene models. Note that 
the percentages were computed with respect to AS 
events that are not in the TAIR9 annotations, so do not 
depend on the algorithm's access to the annotations. 



Table 4 Recall of splice junctions identified from EST 
data 



Species 


SpliceGrapher 


Supersplat 


TopHat 


A. tholiono 


24,757 


25,590 


22,934 


V. vinifera 


35,403 


37,291 


34,081 



We compared the splice junctions that each package identified in RNA-Seq 
data to splice junctions inferred from ESTs aligned to A. ttialiana and V. 
vinifera. Numbers represent the number of splice junctions from spliced 
alignments of RNA-Seq data that were also identified from alignments of ESTs 
(excluding junctions that are annotated in TAIR9). Despite performing a step 
of filtering of false positives, SpliceGrapher achieves a higher level of recall 
than TopHat, and only slightly lower than SuperSplat. 



With TAU we also saw a decrease in the percentage of 
false positive events when provided with gene models. 

Each of the three packages use somewhat different cri- 
teria for accepting splice-junction alignments. Both 
Supersplat and TopHat allow users to control maximum 
allowed intron length; SpliceGrapher on the other hand 
only accepts splice junctions that are within a single 
gene. We have found many examples where TAU and 
Cufflinks predicted splice junctions that span two genes; 
with TAU for example, we found 309 cases of splice 
junctions that span multiple genes in A. thaliana and 
341 cases in V. vinifera. 

Incorporating ESTs 

To demonstrate SpliceGrapher's ability to incorporate 
multiple forms of data into splice graph predictions, we 
ran it with the A. thaliana RNA-Seq data and provided 
the EST alignments described above. Figure 5 illustrates 
the benefit of combining RNA-Seq data and EST align- 
ments for predicting splice graphs. In this example, 
ESTs provided evidence for an IR event that was not 
detected in the RNA-Seq data because read coverage 
was not consistent across the entire intron. RNA-Seq, 
on the other hand, provides evidence for alternative 3' 
splicing events. The predicted graph then incorporates 
the full range of evidence into a single coherent splice 
graph. 

The addition of EST data increased the number of AS 
events from 7,388 to 9,916 and the number of genes 
where AS is observed from 4,901 to 6,162. The graphs 
augmented with EST alignment information contain 
novel AS events in proportions that were much closer 
to the proportions in the TAIR9 models than graphs 
predicted using RNA-Seq alone. Notably, the proportion 
of IR events is the same as in the gene models. 
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Gene Model for AT1G53400 




ESTs for AT1G53400 
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SpliceGrapher AT1G53400 Prediction 




Alt. 3' Intron Retention 

Figure 5 SpliceGrapher prediction from RNA-Seq and EST data. This example shows how SpliceGrapher can use both RNA-Seq data and 
EST data to produce predictions that incorporate the strengths of each data type. RNA-Seq data provide evidence for two novel splice junctions 
(fourth panel down, highlighted in green) that SpliceGrapher uses to infer an alternative 3' splicing event. EST alignments provide compelling 
evidence for an intron retention event. SpliceGrapher combines these predictions into the final predicted graph. 

V J 



demonstrating the utility of ESTs for detecting IR 
events. Finally, the level of recall of TAIRIO annotation 
has resulted in a large increase with the addition of the 
EST data (Table 2). 

Applying SpliceGrapher to mammalian genomes 

To demonstrate SpliceGrapher's utility with mammalian 
genomes, we used a subset of the human RNA-Seq data 
used in [40] to compare AS in 60 individuals. We 



focused on the two individuals for whom a much larger 
set of reads was available. The data consisted of 74 mil- 
lion paired-end 35-nucleotide reads from two indivi- 
duals: Caucasian (35.7 million reads) and Yoruban (38.3 
million reads). ENSEMBL gene models were used as 
additional input to SpliceGrapher. 

Based on Homo sapiens gene models, SpliceGrapher's 
classifiers achieved ROC scores of 0.97, 0.95 and 0.91 
for GT, AG and GC splice sites, respectively. These 
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classifiers predicted 25.3 million potential splice sites 
out of 105.2 million dimers that occurred within genes. 
Because of the longer gene length in human, the 
approach of creating a splice junction database was not 
feasible. Instead, we used TopHat to perform spliced 
alignments and filtered the results using our predicted 
splice site database. 

We ran TopHat with default parameter settings to 
align both data sets. Our classifiers flagged close to 60% 
of the splice junctions in both datasets as false-positives 
(Table S4 in Additional file 1). This is in comparison to 
14% in Arabidopsis and 58% in grape, which is further 
illustration of the potential pitfalls in splice junction 
alignment of short reads. 

Using the filtered TopHat splice junction alignments, 
we found 1,099 novel AS events in the Caucasian sam- 
ple and 1,154 novel AS events in the Yoruban sample. 
The original paper reported statistics on 110 AS events 
in genes where genetic variation impacted differential 
alternative expression, while our predictions span the 
whole genome. The various types of AS are predicted at 
rates similar to those in the gene models with the nota- 
ble exception of IR, which was relatively hard to predict 
from plant RNA-Seq as well. The number of novel AS 
predictions in the two samples were similar for each AS 
type. Full details are provided in Table S5 in Additional 
file 1; Figure SIO in Additional file 1 shows an example 
where we identified a difference in the observed pattern 
of AS in the two samples. 

Conclusions 

We have presented SpliceGrapher, a Python scripting 
package designed to enhance existing gene annotations 
by predicting splice graphs from RNA-Seq and EST 
data. In addition, SpliceGrapher includes modules for 
identifying AS events and for viewing predicted splice 
graphs along with the evidence used to generate them. 

We compared SpliceGrapher to TAU and Cufflinks; 
unlike these packages, which make somewhat limited 
use of gene models, SpliceGrapher exploits gene annota- 
tions fully to construct its splice graphs. This allows 
SpliceGrapher to make meaningful predictions even for 
genes that have low read coverage, and helps us resolve 
AS events that are otherwise hard to detect from short- 
read data. Prediction of AS requires accurate alignment 
across splice junctions, which is a difficult task when 
read length is short. Our work demonstrates that using 
tools that do not use information about splice site char- 
acteristics can lead to a large number of possibly erro- 
neous alignments. SpliceGrapher addresses this issue by 
using a machine learning approach to construct a data- 
base of potential splice sites that can be used to filter 
the output of a spliced alignment method such as 
TopHat, or to construct a database of potential splice 



junctions that allow the use of standard mapping tools 
that do not perform spliced alignment. The latter 
approach is only feasible for more compact genomes 
like the plant genomes we analyzed here. 

We have demonstrated SpliceGrapher on RNA-Seq 
data from human and two plants. A, thaliana and V, 
vinifera. Plant genomes have different architectures 
from mammalian genomes and are therefore character- 
ized by different types of AS events. In particular, IR is 
rare in mammalian genomes, but is the dominant form 
of AS in plants. Most existing work on predicting AS 
from short-read data has focused on exon skipping in 
mammals [1,18,19,41-44]. We have shown that Splice- 
Grapher is able to predict different forms of AS, 
although the rate at which it detects intron retention is 
lower than its observed frequency in existing annota- 
tions. This is due to the inherent difficulty in predicting 
intron retention from RNA-Seq data, which requires 
consistent read coverage across an entire intron. 

Materials and methods 

RNA-Seq data 

In our experiments, we used publicly available short- 
read sequence data for the plants A, thaliana and V, 
vinifera. We downloaded FASTQ sequence files contain- 
ing 284 million A. thaliana reads (accession number 
[SRA:009031]) and 59 miUion 36-nucleotide V. vinifera 
reads (accession number [SRX:012280]) from the NCBI 
Sequence Read Archive [45]. A, thaliana reads were 
trimmed to 32 nucleotides as recommended in [3]. As 
baseline annotation for A. thaliana we used the TAIR9 
genome annotations [46]; the newer TAIRIO annota- 
tions were used for evaluation. For V, vinifera we used 
the Genoscope version 2 annotations and sequences, 
using only those sequences for the 19 best-characterized 
chromosomes. Additional RNA-Seq data were generated 
for A. thaliana as follows. Total RNA from 2-week-old 
A, thaliana (ecotype Columbia) seedlings grown on MS 
(Murashige and Skoog medium) plates was isolated 
using RNeasy Plant Mini Kit from Qiagen (Valencia, 
CA, USA). To remove any contaminating DNA, RNA 
was treated with DNAse [47]. Isolation of poly (A) 
mRNA and preparation of a cDNA library were carried 
out using the lUumina TrueSeq RNA kit. Sequencing 
(72 cycle) was done on an Illumina Genome Analyzer II. 
This dataset, which altogether has 41 million reads, has 
been deposited in NCBI's Gene Expression Omnibus 
[48], and is accessible through GEO series accession 
number [GSE:32318]. 

The SpliceGrapher pipeline 

SpliceGrapher's pipeline for predicting splice graphs 
uses several forms of data. Gene models are loaded 
directly into a splice graph that serves as a baseline for 
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interpreting the RNA-Seq and EST data. Data from 
RNA-Seq experiments is aligned to the reference gen- 
ome; ungapped alignment is carried out first, and all 
unaligned reads undergo spliced alignment, to infer 
reads that span splice junctions. SpliceGrapher can 
accept short-read alignments provided by a variety of 
tools [9-12,14-17]. EST and cDNA sequences are aligned 
using conventional sequence alignment tools [39,49] 
before SpliceGrapher includes the alignments in a splice 
graph. As SpliceGrapher loads each form of data, it inte- 
grates the new evidence with the gene models to con- 
struct a splice graph. 

Splice graph construction 

Each of the data sources used by SpliceGrapher-genome 
annotations, short-read data, and EST alignments- 
requires a distinct interpretation for spUce graph con- 
struction. For the following discussion of the procedures 
used to interpret each type of data we borrow terminol- 
ogy from [29] and refer to exons that have explicit 
acceptor and donor splice sites as 'bounded' and those 
with an undefined acceptor or donor splice site as 
unbounded'. When one graph element, such as an exon 
or an intron, falls completely within the genomic coor- 
dinates of another graph element, we say that it is 'con- 
tained' within the other element. 
Gene models 

SpliceGrapher accepts gene model annotations in GFF3 
format [50] and constructs graphs using the sequence 
coordinates found in UTR, CDS and exon records. It 
interprets exon records as bounded exons and incorpo- 
rates them directly into a splice graph. SpliceGrapher 
infers an intron whenever the corresponding exons are 
adjacent in a transcript. When an exon appears in mul- 
tiple transcripts, a single exon node is created along 
with edges that link it to the exons that follow or pre- 
cede it in the corresponding transcripts. 
RNA-Seq data 

To incorporate RNA-Seq data into a splice graph, Spli- 
ceGrapher loads read alignments in SAM format [51] 
for all reads that map within the boundaries of some 
annotated gene. It then constructs exons (hereafter 
referred to as 'short-read exons') from clusters of contig- 
uous ungapped alignments where the read depth 
remains above a minimum threshold (the default is 0). 
A short-read exon is discarded if it is contained within 
an existing exon. If a short-read exon does not extend 
exactly to known or predicted acceptor or donor splice 
sites, SpliceGrapher will extend it to the nearest splice 
site that is known or has support from spliced reads. 
EST alignments 

Because they are longer than RNA-Seq reads, ESTs pro- 
vide more reliable transcriptional evidence. In our fra- 
mework we run EST alignments through a series of 



processing steps designed to remove alignment artifacts 
[32] and convert the alignments directly to splice-graphs 
using the same procedures we developed for gene mod- 
els. Exons at the 3' or 5' end of an EST are considered 
unbounded. These are merged with other exons in the 
graph according to the following rules. An exon that is 
unbounded at its 3' end is merged with another exon if 
they share the same acceptor site and both exons are 
unbounded at the 3' end. If one of the exons is 
bounded, the exons will be merged only if the 
unbounded exon is contained within the bounded exon. 
Analogous rules apply to exons unbounded at the 5' 
end. SpliceGrapher infers an intron between exons 
whenever they are adjacent in an EST. 

Alternative splicing inference from RNA-Seq 

Because of their short length, RNA-Seq data cannot be 
unambiguously interpreted as splice-graphs and AS 
events (Figure 3). SpliceGrapher's approach is to use as 
much data as possible to make confident predictions, 
and to annotate AS events as unresolved if the evidence 
does not clearly support a specific isoform. SpliceGra- 
pher applies inference rules in the order presented in 
the following sections. 
Intron retention 

IR is arguably the most challenging form of AS to infer 
from RNA-Seq data. SpliceGrapher infers IR events 
from RNA-Seq evidence in two ways that exploit infor- 
mation from the gene models. When short-read cover- 
age remains above a desired threshold across an intron's 
full length, it is evidence that the intron was retained in 
some transcripts (Figure 1). In this case the intron is 
excised in the constitutive form represented by the gene 
model. In an alternative scenario shown in Figure Sll in 
Additional file 1, the intron is retained in the known 
constitutive form. We detect this form of IR when a 
known exon has a novel splice junction within it. 

When SpliceGrapher infers a novel IR event, it must 
identify unique exon boundaries for three exons: the 
longer exon in which the intron is retained, and the two 
shorter exons that flank the intron when it is excised. 
Usually the gene model provides good evidence for 
these boundaries, but in some cases it may not be possi- 
ble to resolve them unambiguously. 

When short-read coverage remains high across an 
intron's full length, SpliceGrapher will create a short- 
read exon that spans the intron. Its next task is to deter- 
mine the exon's correct boundaries. SpliceGrapher first 
finds all exons in the graph that overlap the short-read 
exon. If one upstream exon overlaps its 5' end and one 
downstream exon overlaps its 3' end, SpliceGrapher cre- 
ates a new exon whose boundaries are the acceptor site 
from the upstream exon and the donor site from the 
downstream exon. If more than one exon overlaps either 
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end of the short-read exon, it is still possible to infer the 
new exon's boundaries provided all overlapping 
upstream exons share the same acceptor site and over- 
lapping downstream exons share the same donor site 
(Figure S12 in Additional file 1). If the boundary at 
either end is ambiguous, SpliceGrapher creates an unre- 
solved IR event. When a junction is used to infer an IR 
event through the scenario of a splice junction within 
an exon, SpliceGrapher must identify the boundaries for 
two new exons, which is performed in a manner analo- 
gous to the single exon case. 

In some cases read coverage may remain high across 
two or more introns in succession, making it impossible 
to determine which of several possible splice forms is 
correct (Figure 3). In these cases, SpliceGrapher anno- 
tates the corresponding short-read exon as unresolved. 
Alternative 3' and 5' events 

When an intron is excised at more than one splice site, 
changing the boundaries of one of its flanking exons, we 
have evidence of an alternative 3' or 5' event. SpliceGra- 
pher uses two forms of evidence to infer alternative 375' 
events. When a short-read exon overlaps an existing 
exon but extends beyond its 3' or 5' end, it provides evi- 
dence for an alternative donor or acceptor site. In addi- 
tion, when a novel splice junction appears between two 
exons it provides evidence for a novel intron (Figure 
S12 in Additional file 1). SpliceGrapher requires both 
forms of evidence to infer a novel alternative splice site. 
Below we describe the procedure for inferring an alter- 
native acceptor (3' site). The procedure for an alterna- 
tive donor (5' site) is analogous. 

When a short-read exon overlaps an existing exon and 
extends into its upstream intron, it is evidence that the 
exon boundaries changed in some transcripts. To iden- 
tify an acceptor site for the new exon, SpliceGrapher 
looks for junctions that have acceptor sites within the 
same intron, upstream of the short-read exon (Figure 
S12 in Additional file 1). SpliceGrapher then uses the 
acceptor site nearest the short-read exon as its acceptor 
site. If it finds no acceptor sites within the intron, Spli- 
ceGrapher annotates the short-read exon as unresolved. 

If SpliceGrapher can resolve a new exon's acceptor 
site, it must resolve its donor site as well. The procedure 
is the same as that for identifying retained intron 
boundaries in the previous section. If one downstream 
exon overlaps the new exon's 3' end, SpliceGrapher uses 
the downstream exon's donor site as the new exon's 
donor site. If more than one downstream exon overlaps 
the new exon's 3' end, SpliceGrapher can still resolve its 
donor site provided all overlapping exons share the 
same donor site (Figure S12 in Additional file 1). If the 
overlapping exons have different donor sites, SpliceGra- 
pher cannot resolve the new exon's donor, and it anno- 
tates the exon as unresolved. 



Exon skipping 

An exon skipping event occurs when an exon is 
excised from some transcripts but included in others. 
SpliceGrapher infers skipped exons in two different 
ways. In the first scenario the exon is included in the 
known constitutive form represented in the gene 
model. If a novel splice junction spans the existing 
exon, it is evidence that the exon was skipped in some 
transcripts (see top panel of Figure S13 in Additional 
file 1). If the novel junction's acceptor and donor sites 
match those of established exons, SpliceGrapher adds 
the new intron to the graph and annotates the skipped 
exon. 

An alternative scenario is when the exon is skipped in 
the constitutive form (bottom panel of Figure S13 in 
Additional file 1). In this case, if a short-read exon faUs 
within an intron and is flanked by two novel junctions, 
it is evidence for a novel exon that is skipped in some 
transcripts. These clues may not provide enough evi- 
dence to resolve the event, so SpliceGrapher tries to 
associate the upstream junction's donor site and the 
downstream junction's acceptor site with exons in the 
graph. If this first step is successful, SpliceGrapher uses 
the upstream junction's acceptor site as the new exon's 
acceptor site and the downstream junction's donor site 
as the new exon's donor site. If it is unable to resolve 
the junctions, SpliceGrapher annotates the short-read 
exon as unresolved. 

Splice graph assembly 

The above rules are used to infer exons from gene mod- 
els, ESTs and short-read data. The final step consists of 
adding edges (introns) that connect consecutive exons. 
For each newly predicted exon, SpliceGrapher looks for 
other exons in the graph that use the same acceptor 
site. If any of these exons has an intron from its accep- 
tor site to a neighboring exon, SpliceGrapher adds an 
edge from the neighboring exon to the new exon. It 
repeats the process for all exons that have the same 
donor site. SpliceGrapher uses an analogous procedure 
to infer introns for the donor site of a new exon. Graphs 
are stored using a GFF file format. 

Splice graph comparisons 

TAU and Cufflinks predict transcripts that we convert 
into splice graphs associated with annotated genes. As 
de novo prediction tools, both TAU and Cufflinks make 
predictions without regard to known gene boundaries. 
Thus, to perform a meaningful comparison we associate 
each transcript with one or more genes that it overlaps. 
Once transcripts are associated with annotated genes, 
they are converted to splice graphs using the same pro- 
cedure SpliceGrapher uses to assemble splice graphs 
from ESTs. 
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In order to compare a predicted splice graph with the 
splice graph generated from genome annotations, we 
compare the number of exons and introns from the 
gene model that are recapitulated. This is quantified 
using the 'recall' statistic. The recall of a set A with 
respect to a set B is defined as: 



We define R{A\B) = 1 when B is the empty set. To 
compare two splice graphs we compute recall for exons 
and introns separately. 

Short read alignment 

SpliceGrapher includes modules that allow it to incorpo- 
rate RNA-Seq data into splice graph predictions. It 
accepts short read alignments in SAM format [51]. For 
this project we used PASS [10] to perform short-read 
alignments as we found it to be fast and accurate in pre- 
liminary tests on synthetic data. 

To provide high-confidence splice graph predictions, 
we enforced strict criteria for accepting alignments. We 
accepted only reads that aligned to the genomic refer- 
ence with 100% identity at a unique location. To 
increase confidence in these alignments, we also 
required that a read align nowhere else in the genome 
at 90% identity or above (a procedure adapted from 
[13]). We used uniquely aligned reads to identify exonic 
regions, and unaligned reads for splice-junction 
alignments. 

Splice junction reads 

SpliceGrapher performs spliced alignment by first con- 
structing a database of splice junction sequences that 
are formed by concatenating the sequence directly 
upstream of a donor site with the sequence directly 
downstream of an acceptor site. We distinguish three 
kinds of splice junctions: 'known' junctions that are 
derived from gene model annotations; 'recombined' 
junctions constructed from novel combinations of 
known splice sites; and 'predicted' junctions in which 
one or both splice sites are novel. SpliceGrapher con- 
structs junction sequences by pairing every known and 
predicted donor site in a gene with all known and pre- 
dicted acceptor sites downstream of it in the same gene. 
This procedure has been used in other studies (see, for 
example, [1,18,19]) to construct a database of known 
and recombined splice junctions from known splice 
sites. SpliceGrapher extends this procedure to include 
predicted sites as well. 

To avoid spurious alignments, we require that junc- 
tion-crossing reads align with a minimum 10-nucleotide 
overlap on either side of a junction. Adopting 



nomenclature from [14], we refer to the 10-nucleotide 
region on either side of a junction as the anchor' region. 
For reads of length n and a required overlap size p, we 
can enforce this constraint by generating splice junction 
sequences of length 2{n-p), For example, for 32-nucleo- 
tide reads and a required 10-nucleotide minimum over- 
lap, we generated 44-nucleotide sequences (22 
nucleotides on either side of a junction). To improve 
sensitivity, we accepted alignments with 90% identity 
overall, provided they aligned with 100% identity within 
the anchor region. 

The three pipelines each used slightly different align- 
ment criteria. Our SpliceGrapher pipeline allowed mis- 
matches in exonic alignments so that we could identify 
and eliminate reads that aligned well to multiple loca- 
tions. We also accepted mismatches in splice-junction 
alignments to improve sensitivity, provided there were 
no mismatches within anchor regions. We used TopHat 
parameters that allowed us to duplicate our own align- 
ment criteria, though TopHat accepts mismatches 
within splice-junction anchor regions. Supersplat accepts 
only reads that align with 100% identity within anchor 
regions. 

Splice site prediction 

We classify splice sites using support vector machines, 
using an approach that has been shown to be highly 
accurate in splice site prediction [12,15,52,53]. To create 
splice site classifiers, SpliceGrapher extracts positive and 
negative example sequences for splice site donor and 
acceptor dimmers such as GT, GC and AG following 
the procedure described in [52]. Briefly, we use known 
splice sites as positive examples for a given dimer and 
for negative examples we use all other occurrences of 
the dimer found within genes. Training examples then 
consist of intronic and exonic sequences taken from 
either side of a site. 

SpliceGrapher's classifiers discriminate sequences 
using an implementation of the weighted-degree kernel 
[52]. This kernel represents a sequence in a feature 
space of k-mers associated with positions in the 
sequence. Kernel parameters include exon and intron 
sequence lengths on either side of a splice site, k-mer 
length, number of mismatches to allow within a k-mer, 
and whether to allow shifts in k-mer position (for a 
detailed overview, see [54]). SpliceGrapher iterates over 
combinations of these parameters to identify the best- 
performing combination. It uses the PyML package [55] 
to train and test these support vector machines using 
the training examples described above. SpliceGrapher 
executes this procedure for all given acceptor and donor 
dimers to yield a classifier for each one. It then applies 
each classifier to corresponding dimers in the genomic 
reference sequences. Locations classified as splice sites 
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contribute a pool of predicted sites that SpliceGrapher 
combines with known splice sites when it generates 
splice junction sequences. The Palmapper package [15] 
uses a similar set of classifiers, but evaluates acceptor 
and donor sites during spliced alignment instead of 
creating a splice junction database. 

In predicting splice sites we focused on canonical GT 
and GC donor sites and AG acceptor sites. For A. thali- 
ana, we used SpliceGrapher with TAIR9 annotations to 
extract 122,534 annotated GT, GC and AG splice sites. 
SpliceGrapher created splice site classifiers for GT and 
GC donor sites and AG acceptor sites. These classifiers 
achieved ROC scores of 0.97, 0.97 and 0.95 for GT, GC 
and AG sites, respectively, in five-fold cross-validation. 
ROC curves are provided in Figure S9 in Additional file 
1. These classifiers predicted novel splice sites for 
878,994 out of 9,753,440 dimers (9%) found within 
genes in the TAIR9 reference sequences. Combining 
these predicted sites with known splice sites, we gener- 
ated a splice junction database with 8.2 million junc- 
tions, out of which 122,534 are known and 586,704 are 
recombined. 

For V, vinifera, SpliceGrapher extracted 125,080 anno- 
tated GT, GC and AG splice sites from the gene models. 
The splice-site classifiers achieved ROC scores of 0.91, 
0.81 and 0.88 for GT, GC and AG sites, respectively, in 
five-fold cross-validation. As V, vinifera is not as well 
annotated as A, thaliana, we decided to use the EST 
alignments described earlier to obtain better splice site 
models. We used SpliceGrapher to extract 84,811 splice 
sites from these alignments, and used them to improve 
our splice site models. The resulting classifiers achieved 
ROC scores of 0.98, 0.92 and 0.96 for GT, GC and AG 
sites, respectively, considerably higher than those based 
on the gene models. The ROC curves are shown in Fig- 
ure S9 in Additional file 1. These classifiers predicted 
2.2 million putative splice sites out of 11.2 million 
dimers (20%) within genes in the reference sequences. 
The resulting splice junction database contains 91 mil- 
lion splice junction sequences, out of which 125,080 are 
known, and 616,314 are recombined. 

HashMatch/Supersplat/TAU 

We ran HashMatch and Supersplat [17] on the two data 
sets, following the procedures outlined in [3]. We first 
used HashMatch to perform ungapped read alignment. 
We then perform spliced alignment with Supersplat 
using those reads that did not result in a match in 
ungapped alignment. Supersplat performs alignment 
without regard for splice site dimers, so for a more rea- 
listic comparison we accepted only alignments that 
spanned canonical dimers. Supersplat accepts minimum 
and maximum intron lengths and anchor region size. It 
is prudent to select conservative intron sizes to prevent 



TAU from generating an inordinate number of tran- 
scripts. For A, thaliana we established an intron size 
range of 40 to 5,000, as 99.9% of known introns fell 
within this range. For V. vinifera the same criterion 
yielded an intron size range of 55 to 15,000. To fix a 
lower bound for overlaps on either side of a junction, 
we set the minimum anchor size to 10, and accepted 
only reads that aligned to a unique splice junction. All 
experiments used Supersplat version 1.0 and TAU ver- 
sion 1.4 (no version information was available for 
HashMatch). 

BowTie/TopHat/Cufflinks 

TopHat and Cufflinks were designed for mammalian 
genomes and thus rely on some heuristics that are 
based on mammalian gene expression statistics and 
gene architecture. For example, TopHat's heuristic filter 
for spliced alignments is based on the observation that, 
in humans, minor splice forms usually have expression 
levels that are at least 15% as high as those of their cor- 
responding major splice forms [14]. Another heuristic 
embedded into TopHat is to report only alignments 
across GT-AG introns for reads shorter than 75 
nucleotides. 

We ran TopHat on the A, thaliana data using para- 
meters that reflected the requirements we set for our 
own alignments: no multi-hits (-g 1), minimum anchor 
length 10 (-a 10), and minimum and maximum intron 
length 40 and 5,000, respectively (55 and 15,000 for V. 
vinifera), TopHat splits reads into segments for part of its 
search. To permit reads to align at 90% identity, we set 
the segment length to 20 (-segment-length = 20) and 
allowed up to 2 mismatches per segment (-segment-mis- 
matches = 2). To eliminate the heuristic filter associated 
with mammalian genomes, we set the minimum normal- 
ized depth to 0 (-F 0). To make our comparison as fair as 
possible, we also ran TopHat with -segment-mismatches 
= 0 to force it to accept only alignments with 100% iden- 
tity. We then approximated our own alignment criteria 
by using TopHat's ungapped alignments at 100% identity 
and its spliced alignments at 90% identity. The remaining 
difference between TopHat's splice-junction alignments 
and ours was a subset of TopHat read alignments that 
contained mismatches in the anchor region. All experi- 
ments used Bowtie version 4.1.2, TopHat version 1.3.3 
and Cufflinks version 1.1.0. 

Additional material 

^ 

Additional file 1: Supplementary information. Includes a section on 
validating SpliceGrapher predictions with evidence from other 
experiments, plus supplementary tables and figures. 
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